Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
Feeling a little bewildered. Lots to take in. Had forgotten about the syntax of R, been years since last time since I used it. The intro throught the book was thorough, but reading is for not the same as hands-on when it comes to coding. At least the book works as a good source for reference in the future.
https://github.com/lileinon/IODS-project
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Mon Nov 28 22:52:24 2022"
The text continues here.
This week we dug into regression analysis, and model validation. Both very useful and interesting, and, I have to say, somewhat foreign to me. Too long has passeed since I last dabbed in statistics. Much reading was to be done to be able to grasp the concepts well enough. Overall I think this week gave me a good basic understanding of how to conduct regressional analysis with RStudio.
date()
## [1] "Mon Nov 28 22:52:24 2022"
This week’s dataset komes from Kimmo Vehkalahti. It contains data collected by him in a survey conducted among university students in an introductory course to social sciences. The data was collected between Dec. 2014 and Jan. 2015. The survey was conducted among 183 students, and it includes data on 56 questionnaire variables, 1 compound variable (attitude), and three contextualizing variables (age, sex, points in final exam), bringing the total to 60 columns.
This data was cleaned and made uniform as follows: - the survey measured three different approaches of learning, represented by answers in appropriate columns. These values were merged into three compound value columns corresponding to three different approaches: deep, surface (surf) and strategic (strat). The numerical value of the new columns is the mean answers pertinent to each category (original answers were on a scale of 1 to 5). - The “attitude” compound column was scaled to the same level (1-5). - the now superfluous raw data survey columns were removed, leaving 7 columns: age, sex, points, attitude, deep, surf and strat. - rows with values of zero in the “points” column (17 in total) were removed, leaving 166 rows. Since the aim of the data was to measure the impact of studying attitudes and the approaches to learning with success in exam as the main measurement of their impact, the rows without exam results were redundant. - as minor data curation procedures the naming of headers was normalized
In the following code the .csv-file containing the cleaned data is read into the variable “learning2014”, and its dimensions (dim) and structure (str) are checked to make sure eveything is as it should be.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.4.0 âś” purrr 0.3.5
## âś” tibble 3.1.8 âś” dplyr 1.0.10
## âś” tidyr 1.2.1 âś” stringr 1.4.1
## âś” readr 2.1.3 âś” forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
setwd("data")
learning2014 <- read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014)
## [1] 166 7
str(learning2014)
## spc_tbl_ [166 Ă— 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
The data and its seven main variables can be summarized as follows:
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = learning2014$gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The data includes more female (110) and younger (median age 22 years) participants. The strongest correlation with exam points is found with attitude of students and “strategic” approach to learning. Male participants show somewhat higher scores in attitude. Age correlated best with “strategic” approach to learning. The three variables “deep”, “strategic” and “attitude” show positive interralation. We will choose these last three as explanatory variables, with points being the target variable.
my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
This summary shows the significance of each explanatory variable in relation to the target variable. First it lists residuals, the difference of the predicted model and the actual value. Of the coefficients interlinked t value and p value (Pr(>|t|)) are most significant. Looking at them, it seems that the variable “deep” does not bear correlation with points (Pr(>|t| 0.31974), and is statistically non-significant. Thus we will leave it out of the equation and recalculate with two explanatory variables.
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
A slightly better result. Attitude seems to be the single most significant single factor explaining the exam result (p < 6.31e-09). Of the different approaches to studying “strategic” bears most significance. It has a p-value of 8.9%, verging on being statistically significant, but not quite. The adjusted R-squared -value of 0.1951 tells us that c. 19.5% of the points the students got from the exam can be explained by their attitude and adherence to strategic approach to learning, with attitude accounting for almost all of this effect. The p-value is 7.734e-09, very significant, and the F-statistic 20.99 is considerably larger than Critical F on 2 and 163 degrees of freedom (~3.0). Therefore we can reject the null hypothesis.
What remains are regression diagnostics, shown below in three tables.
par(mfrow = c(2,2))
plot(my_model2, c(1, 2, 5))
The residuals vs Fitted model -graph confirms that the fitted model is appropriate; the residuals and the fitted values are uncorrelated. The Normal Q-Q shows no evidence of major departure from linearity, confirming the sample data follows normal distribution and is not skewed. Residuals vs Leverage shows no data points fall outside Cook’s distance, and therefore there are no influential individual datapoints. Everything checks out.
This week was all about the title. The statistics were interesting and the tools powerful. Should be useful in the future. The only problem I continuously have with R is that every function seems to be like a “magic box”. I just put in some numbers, and it does its magic in secret, spewing out magnificent graphs and whatnot, but I always feel like I have no clue or maybe even control how it does what it does. Like driving with an autopilot: just enter the destination, and the car takes care of everything. Can I say that I drove there? Or even that I can drive, if all I do is input destinations? Each function works differently; this seems not so much about learning general R syntax, but instead learning which function does what, and what you need to feed it. Perhaps the feeling ensues because earlier on I have mostly done everything step-by-step by myself, each graph and such. This feels almost like cheating at times, and on the other hand like giving away control.
date()
## [1] "Mon Nov 28 22:52:33 2022"
Also, loading all the needed libraries here.
library(scales); library(dplyr); library(tidyverse); library(ggplot2); library(boot)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
Okey dokey. This week’s dataset comes from Portugal. It was collected by P. Cortez and A. Silva for a paper published in 2008, and the raw data was made available on UCI Machine Learning Repository in 2014 (link, with info on the variables etc.). Originally the dataset consisted of two distinct but overlapping questionnaires conducted in a math class and portuguese class. The questionnaire gathered much background info of the students, e.g. their domestic situation, siblings, the occupation of the parents, and the student’s alcohol consumption (full list of column names below; only the chosen variables explained in detail). As target variables the data included three grades, G1 and G2 the grades of first two trimesters, and G3 as the grade for the third trimester and for the whole course at the same time. All in all the data consists of 33 variables, of which 27 were contextualizing variables describing the student’s background and social status, and 6 changing variables connected with how well they did in class (the three grade -columns; absences; previous failures of the course; and, whether the student had taken extra paid classes on the course subject or not).
The dataset for math students included 395 students and for portuguese 649 students. Although anonymized, the background questions allowed combining students who had answered both questionnaires, and selecting only them . This left us with 370 students. Their grades were averaged for a single number. The questionnaire’s two columns on alcohol consumption (graded on a scale of 1-5) were combined into one column, alc_use, and those whose weekly dosage grade surpassed 2 were then marked out into separate high_use -column. The meaning of this analysis is to study the relationship between high/low alcohol consumption and some of the other variables in the data.
setwd("data")
alc <- read_csv("alc.csv", show_col_types = FALSE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The 4 variables I have chosen to be analysed in relation with alcohol consumption are:
Going out a lot with friends would probably be most highly correlated with high alcohol consumption. High alcohol consumption also could cause absences from classes. The last two, on the other hand, are such that they would limit ones ability to partake in sociable drinking.
This will be a “preliminary” probe, with brief overviews on each chosen variable.
Absences is a numeric value with amount of absences, high_use is a binary. Correlation, if any, should be visible from a simple boxplot.
plot_absences <- ggplot(alc, aes(x = high_use, y = absences))
plot_absences + geom_boxplot() + ggtitle("1. Amount of absences vs. high use of alcohol")
The hypotheses seems to be corrected, and the correlation significant.
Going out was graded on a scale from 1 to 5, and high_use is a binary. A barplot should reveal if there are any differences between the two groups of high-users and low-users.
plot_goout <- ggplot(alc, aes(x = goout))
plot_goout + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("2. going out with friends vs. high alcohol usage")
As visible from the barplot, the high users do go out a lot more than the low users. The barplot leans heavily to the right. A simple check of mean values and distribution percentages confirms the hypotheses.
alc %>% group_by(high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 2 Ă— 3
## high_use count mean_goout
## <lgl> <int> <dbl>
## 1 FALSE 259 2.85
## 2 TRUE 111 3.73
alc %>% group_by(high_use, goout) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 10 Ă— 4
## # Groups: high_use [2]
## high_use goout count percentage
## <lgl> <dbl> <int> <chr>
## 1 FALSE 1 19 7.34%
## 2 FALSE 2 82 31.66%
## 3 FALSE 3 97 37.45%
## 4 FALSE 4 40 15.44%
## 5 FALSE 5 21 8.11%
## 6 TRUE 1 3 2.7%
## 7 TRUE 2 15 13.5%
## 8 TRUE 3 23 20.7%
## 9 TRUE 4 38 34.2%
## 10 TRUE 5 32 28.8%
Extra-curricular activities is a binary value, as is high_use. A simple check of the amounts might give us an overview
alc %>% group_by(high_use, activities) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 Ă— 4
## # Groups: high_use [2]
## high_use activities count percentage
## <lgl> <chr> <int> <chr>
## 1 FALSE no 120 46.3%
## 2 FALSE yes 139 53.7%
## 3 TRUE no 59 53.2%
## 4 TRUE yes 52 46.8%
Perhaps surprisingly not a significant difference in extracurricular activities between the two groups; both basically break even. Further inquiries regarding this variable would be futile.
Weekly study time is a numeric field (values: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours). A barplot should do the trick for overview of correlation.
plot_studytime <- ggplot(alc, aes(x = studytime))
plot_studytime + geom_bar() + facet_wrap("high_use", labeller = label_both) + ggtitle("4. Weekly study time vs. high alcohol usage")
alc %>% group_by(high_use, studytime) %>% summarise(count = n()) %>% ungroup() %>% group_by(high_use) %>% mutate(percentage = percent(count/sum(count)))
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 8 Ă— 4
## # Groups: high_use [2]
## high_use studytime count percentage
## <lgl> <dbl> <int> <chr>
## 1 FALSE 1 56 21.6%
## 2 FALSE 2 128 49.4%
## 3 FALSE 3 52 20.1%
## 4 FALSE 4 23 8.9%
## 5 TRUE 1 42 37.8%
## 6 TRUE 2 57 51.4%
## 7 TRUE 3 8 7.2%
## 8 TRUE 4 4 3.6%
As can be seen, almost 90% of high alcohol users studied less than 5 hours per week, whereas lower alcohol users are more normally divided, with much more time used on studies. Correlation is strong here.
Okey dokey. First the code, then the analysis.
m <- glm(high_use ~ absences + goout + activities + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + activities + studytime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9196 -0.7732 -0.5083 0.8446 2.5676
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.34944 0.53871 -4.361 1.29e-05 ***
## absences 0.06961 0.02213 3.145 0.00166 **
## goout 0.73572 0.11959 6.152 7.64e-10 ***
## activitiesyes -0.31151 0.25661 -1.214 0.22476
## studytime -0.56049 0.16914 -3.314 0.00092 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 375.68 on 365 degrees of freedom
## AIC: 385.68
##
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.09542238 0.03227113 0.2680871
## absences 1.07208689 1.02777425 1.1223836
## goout 2.08697580 1.66084887 2.6570839
## activitiesyes 0.73234019 0.44132598 1.2094464
## studytime 0.57093162 0.40522569 0.7880221
Absences, going out and studytime are significant statistically (p-values 0.00166, 7.64e-10 and 0.00092 respectively). Activities, on the other hand, are statistically insignificant as noted already earlier. The correlations were also correctly assumed in the preliminary hypotheses in regard of the three statistically significant variables.
OR and CI show that 1 absence equals 7 percent heightened likelihood of the person belonging to the group of high users, with values in real population falling between 2.7 and 12.2 percent. Similarly belonging to 1 higher group in studytime meant 43 percent reducted likelihood to belong to high users, real life population values between 59.5 and 21.2 percent. Belonging to higher group in going out -variable, on the other hand, meant a whopping 108 percent heightened likelyhood to be a high alcohol user, with real population values between 66 and 165.7 percent. The CI of activities includes 1, which shows its effect, positive or negative, can not be speculated in real population.
We’ll use the statistically significant variables found, absences, goout and studytime.
m2 <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc2'
alc2 <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc2 <- mutate(alc2, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 239 20
## TRUE 63 48
g <- ggplot(alc2, aes(x = probability, y = high_use))
# define the geom as points and draw the plot
g + geom_point(aes(col = prediction))
# tabulate the target variable versus the predictions
table(high_use = alc2$high_use, prediction = alc2$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64594595 0.05405405 0.70000000
## TRUE 0.17027027 0.12972973 0.30000000
## Sum 0.81621622 0.18378378 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc2$high_use, prob = alc2$probability)
## [1] 0.2243243
Loss function gives 22.4% falsely predicted individuals. The false positives and false negatives in cross-tabulation gives 17.0% + 5.4% = 22.4% of falsely predicted individuals, the same number. This is the total proportion of inaccurately classified individuals (= the training error).
Does your model have better test set performance (smaller prediction error using 10-fold cross-validation) compared to the model introduced in the Exercise Set (which had about 0.26 error). Could you find such a model? (0-2 points to compensate any loss of points from the above exercises)
cv <- cv.glm(data = alc2, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351
It seems this model has better test set performance (0.23) than the Exercise Set (0.26).
Reflections: This week was horribly busy for me, and also I became ill. I barely had the time to do the exercises and the assignments. I did skim-read through Kimmo’s book chapters also, but I must say that the whole concept of how clustering works seems a bit …vague? Yes, it works, but seems that every method either/both of 1) researcher doing guesswork of where to cut the clusters or 2) the clustering having such a random-factor that the results can only be repeated by forcing a fixed random-seed. Which sounds rather strange and does not make one feel that the process is even possible to be firmly grasped. But, I will try my best. As mentioned in the book, clustering is a basic concept of reasoning, and an absolutely necessary tool for data analysis.
date()
## [1] "Mon Nov 28 22:52:35 2022"
Done.
# I will load all the needed libraries and the dataset here for convenience
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
The Boston dataset is a classic dataset that is distributed as part of the basic R package. Its title, “Housing Values in Suburbs of Boston” describes its origins well enough. The dataset was gathered for the 1978 article “Hedonic prices and the demand for clean air” (J. Environ. Economics and Management 5, 81–102) by D. Harrison and D.L. Rubinfeld. The 14 variables map factors that affected housing prices in suburbs (called “towns” in documentation; we’ll use the same term henceforth) around and include information on - for instance - crime rate, amount of nitrogen oxides in air, and access to Boston’s radial highways. Data has been already cleaned; it entails 506 rows of observations and 14 columns of variables.
The variables are:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix <- cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Many very natural distributions, as with airborne NoX -gases. Yet, some of the values are uneven. Crime-variable for instance varies from 0.006 to 88.976, with a median of 0.25 and mean of 3.61 - it seems that there is a huge variety (and a lot of outliers) in crime between towns, with a few of them accounting for the majority of the cases. Non-retail business, on the other hand, are divided more evenly, but wealth is clearly unevenly divided (lstat and medv). Amount of POC (variable black) does also seem to probably have a lot of outliers (min: 0.32, 1st qu: 375.38, mean 356.67). Some variables seem to bear some interconnection due to similar spread (for instance dis, rad, zn).
As can be seen, some of these values mentioned above are interconnected. Lower status and median value of homes are negatively correlated to a very high degree. Amount of rooms correlates positively with more expensive houses, as can be expected. Interestingly, the dis-variable (distances to employment centres) reveals societal structures and characters of the economical centers: older housing and smaller lots on the zoned land, more (non-retail) businesses which bring more customers and employees driving daily by cars, and hence more nitrogen dioxide in the air; also more crime and more lower status inhabitants. This dynamic seems to work as a sort of a heuristic key on how to read the data. As another instance of the same phenomenon the variable rad “access to radial highways” connects all the datapoints pertinent to the same dynamic: highways connect major centers. Crime rate is connected with this variable to a high degree for this reason.
As an additional check, I will do some boxplots of some of the variables mentioned in the summary analysis. The amount of outliers makes it clear that we are dealing with data that is not evenly spread, and should be checked for clusters. As a preliminary hypothesis we might say that most likely they would be built around the dynamic of more populous and dense commercial towns vs. less populous smaller towns, with much less crime, pollution, commerce, etc.
par(mfrow = c(1, 5))
boxplot(Boston$crim, main = "crim", col = "purple", outcol = "red")
boxplot(Boston$zn, main = "zn", col = "blue", outcol = "red")
boxplot(Boston$lstat, main = "lstat", col = "yellow", outcol = "red")
boxplot(Boston$medv, main = "medv", col = "orange", outcol = "red")
boxplot(Boston$black, main = "black", col = "green", outcol = "red")
Next, we will standardize the dataset and print out summaries of the scaled data. The data will be scaled so that all the mean values of all the variables will be 0 [zero].
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Then, as mentioned in the chapter title, we will create a categorical variable to replace the old crime value with, and the training and testing datasets.
boston_scaled$crim <- as.numeric(boston_scaled$crim)
bins <- quantile(boston_scaled$crim)
labels <- c("low", "med_low", "med_high", "high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = labels)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Next we’ll fit the linear discriminant analysis on the train set. Categorical crime rate is the target variable, with the other variables as predictors.
lda.fit <- lda(crime ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
The variable rad has highest correlation with crime rate, and mainly through it we ware able to cluster the towns with highest crime rates. The second and third most important factors are nox and zn, which basically differentiate the towns in order of their population density.
Next we’ll use the test set data to predict the crime rate classes, and cross-tabulate them with actual results. First we’ll save them as their own variable.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 8 0 0
## med_low 2 20 4 0
## med_high 0 10 14 2
## high 0 0 0 24
As can be seen, the model does predict most of the cases correctly, with more accuracy in the low and high ends of the spectrum.
Following the assignment, we’ll first reload the dataset and standardize it to bring the means to zero.
data(Boston)
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Next, we’ll count the Eucledian distances between the observations, and check out the summary of the values.
dist_eu <- dist(boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Next we’ll run a k-means algorithm on the dataset.
set.seed(13)
km <- kmeans(boston_scaled, centers = 4)
pairs(boston_scaled, col = km$cluster)
Four clusters seems like a lot. Let’s try to gauge the optimal number of clusters via WCSS -analysis. A sharp turn in the graph should indicate where wouuld be a good value to make the cut.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
Optimal point for clustering seems to be 2. Let’s rerun the k-means analysis with 2 clusters.
set.seed(13)
km <- kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
Not very informative due to the small size. Let’s take into closer look the variables we denoted correlated earlier: crim, zn, nox, dis, rad, lstat and medv.
pairs(boston_scaled[, c(1, 2, 5, 8, 9, 13, 14)], col = km$cluster)
Some of these might be still dropped, but in the main it seems that these two clusters denote meaningful and significant differences between two groups of opservations in the data. The commercial and economical centers have smaller lots, poorer air quality, better access to ringroads, more poor people, and less expensive homes. Further analysis would be needed to elaborate more.
Just not enough time. Being ill with a small baby has its drawbacks
Yet, this seems intriguing. Who doesn’t want more dimensions? Gotta try, I’ll give it “15 minutes”, and just list the instructions here as I go through them. Not for the points, but for the coolness factor of 3D thingies.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=classes)
It is beautiful. Oh wait, I can move it! C O O L
# plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)
Okey dokey. This produces error. Probably because the km$cluster is not part of the data matrix that we created… I should join the values from the cluster-column to the train-set, and then create the data matrix and the 3D-plot again… but this time I think I will just go to bed and sleep.
Peace out.
(more chapters to be added similarly as we proceed with the course!)